Exploring the relationship between different services of a villa and the prices of the villa¶

by (Muhammad Abdullahi Said)¶

Investigation Overview¶

In this investigation, I wanted to look at the relationship between different services of a villa and prices of the villas. To see if the prices are anyway related to the services.

Dataset Overview¶

The dataset contains the details of 977 villas with 38 attributes. Most variables are categorical in nature, but the variables area, price, age of property, locality score, project score and builders experience are numerical variables. Some outliers like age of property (with 122 years) and builders experience (greater than 250) were removed.

CPU times: total: 0 ns
Wall time: 0 ns

Describing the attributes¶

Here we look at the count, mean, std, min, 25%, 50%, 75% and the max of all the attributes.¶

we've switched it to transpose to show the whole attribute¶

count mean std min 25% 50% 75% max
area 919.0 2.649709e+03 1.566466e+03 522.0 1.518500e+03 2.350000e+03 3.300000e+03 11500.0
price 919.0 2.492829e+07 2.612990e+07 1800000.0 7.650000e+06 1.950000e+07 3.150000e+07 240000000.0
status 919.0 1.327530e-01 3.394923e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
new/resale 919.0 4.646355e-01 4.990194e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
price_negotiable 919.0 4.570185e-02 2.089514e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
furnished 919.0 1.751904e-01 3.803369e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
age of property 919.0 3.165397e+00 6.302718e+00 0.0 0.000000e+00 0.000000e+00 5.000000e+00 122.0
Lift(s) 919.0 3.841132e-01 4.866497e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Full Power Backup 919.0 6.332971e-01 4.821668e-01 0.0 0.000000e+00 1.000000e+00 1.000000e+00 1.0
24 X 7 Security 919.0 3.090316e-01 4.623458e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Children's play area 919.0 7.094668e-01 4.542556e-01 0.0 0.000000e+00 1.000000e+00 1.000000e+00 1.0
Club House 919.0 5.146899e-01 5.000563e-01 0.0 0.000000e+00 1.000000e+00 1.000000e+00 1.0
Gymnasium 919.0 4.733406e-01 4.995606e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Swimming Pool 919.0 4.744287e-01 4.996176e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Sports Facility 919.0 4.047878e-01 4.911182e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Jogging Track 919.0 2.883569e-01 4.532447e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Landscaped Gardens 919.0 1.099021e-01 3.129380e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
locality_score 919.0 8.023009e+00 9.509524e-01 5.3 8.000000e+00 8.015749e+00 8.300000e+00 9.7
project_score 919.0 8.036305e+00 2.536038e-01 6.7 8.039148e+00 8.040040e+00 8.081644e+00 8.8
builder_experience 919.0 1.881241e+02 3.268600e+02 12.0 5.200000e+01 1.859448e+02 1.862927e+02 2022.0
Rain Water Harvesting 919.0 2.763874e-01 4.474542e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Car Parking 919.0 3.275299e-01 4.695679e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Vaastu Compliant 919.0 1.566921e-01 3.637081e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
Golf Course 919.0 3.590860e-02 1.861636e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
Intercom 919.0 4.755169e-01 4.996721e-01 0.0 0.000000e+00 0.000000e+00 1.000000e+00 1.0
Indoor Games 919.0 1.741023e-01 3.794039e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
Maintenance Staff 919.0 1.055495e-01 3.074275e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
Multipurpose Room 919.0 1.436344e-01 3.509096e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
ATM 919.0 5.223069e-02 2.226130e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
Cafeteria 919.0 7.399347e-02 2.619028e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
Staff Quarter 919.0 6.528836e-02 2.471685e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
Hospital 919.0 2.611534e-02 1.595651e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
School 919.0 2.829162e-02 1.658950e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0
Shopping Mall 919.0 4.134929e-02 1.992052e-01 0.0 0.000000e+00 0.000000e+00 0.000000e+00 1.0

  1. area: Area of the villa range from 522 - 11,500. As Mean > Median, it's rightly skewed.
  2. price: Our taget column value is in 1.8M - 240M range. As Mean > Median, it's rightly skewed.
  3. new/resale: Representing whether a villa is new or a resale. It's a Categorical Variable
  4. price_negotiable: Representing whether a villa price is negotiable or not. It's a Categorical Variable
  5. furnished: Representing whether a villa is furnished or not. It's a Categorical Variable
  6. age of property: Age of property range from 0 years - 122 years. As Mean > Median, it's rightly skewed.
  7. locality_score: Locality score range from 5.3 - 9.7. As Mean ~ Median, it's almost normally distributed.
  8. project_score: Project score range from 6.7 - 8.8. As Mean ~ Median, it's almost normally distributed.
  9. builder_experience: Builder experience range from 12 - 2022. As Mean > Median, it's rightly skewed.
  10. Gymnasium: Representing whether a villa has a gym or not. It's a Categorical Variable
  11. 24 X 7 Security: Representing whether a villa has 24hr security or not. It's a Categorical Variable
  12. Swimming Pool: Representing whether a villa has a swimming pool or not. It's a Categorical Variable
  13. Lift(s): Representing whether a villa has a lift or not. It's a Categorical Variable
  14. Jogging Track: Representing whether a villa has a jogging track or not. It's a Categorical Variable

Continuation

  1. club house: Representing whether a villa has a club house or not. It's a Categorical Variable
  2. Landscaped Gardens: Representing whether a villa has a garden pr not. It's a Categorical Variable
  3. Rain Water Harvesting: Representing whether a villa has a place for collecting water or not. It's a Categorical Variable
  4. Sports Facility: Representing whether a villa has a sport facility or not. It's a Categorical Variable
  5. Car Parking: Representing whether a villa has a car park or not. It's a Categorical Variable
  6. Children's play area: Representing whether a vila has a playground for children. It's a Categorical Variable
  7. Full Power Backup: Representing whether a villa has a power backup or not. It's a Categorical Variable
  8. Vaastu Compliant: Representing whether a villa has the capability to rid negative energy or not. It's a Categorical Variable
  9. Indoor Games: Representing whether a villa has Indoor games or not. It's a Categorical Variable
  10. Intercom: Representing whether a villa has an Intercom or not. It's a Categorical Variable
  11. Maintenance Staff: Representing whether a villa has a maintenance staff or not. It's a Categorical Variable

Continuation

  1. Shopping Mall: Representing whether a villa has a shopping mall or not. It's a Categorical Variable
  2. Cafeteria: Representing whether a villa has a cafeteria or not. It's a Categorical Variable
  3. ATM: Representing whether a villa has an ATM Machine or not. It's a Categorical Variable
  4. Multipurpose Room: Representing whether a villa has a multipurpose room or not. It's a Categorical Variable
  5. School: Representing whether a villa has a school or not. It's a Categorical Variable
  6. Hospital: Representing whether a villa has a hospital or not. It's a Categorical Variable
  7. Golf Course: Representing whether a villa has a golf course or not. It's a Categorical Variable
  8. Staff Quarter: Representing whether a villa has a staff quarter or not. It's a Categorical Variable
  9. status: Representing whether a villa status is ready or not. It's a Categorical Variable

Creating a boxplot for all numerical columns¶

Here we use the boxplot to look at outliers in each attributes. From the figures below we found two attributes with outliers, age of property and builders experience¶

Creating a countplot for location¶

The countplot shows the count of each location. Lohegaon, wagholi, maval and bavdahn have the highest counts.¶

Creating a distribution plot for area¶

The distribution plot shows the distribution and skewness. From the distplot we can see the area is rightly skewed with a value of 1.69 and from the other distplot we saw that 4 rows have an area greater than 9000¶

Skewness is : 1.6798244071303055
Text(0.5, 1.0, 'Analysis of Area - Skweness')
location area price price_currency status new/resale price_negotiable description facing furnished ... Intercom Indoor Games Maintenance Staff Multipurpose Room ATM Cafeteria Staff Quarter Hospital School Shopping Mall
48 Magarpatta 10000 130000000 INR 1.0 0 0 A spacious 8 bhk villa is available for sale i... unknown 1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
322 Tungarli 11000 60000000 INR 0.0 0 0 Well designed 6 bhk villa is available at a pr... northeast 1 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
805 Khandala 11500 200000000 INR 0.0 0 1 Well designed 7 bhk villa is available at a pr... east 0 ... 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0
896 Valvan Lonavla 11000 70000000 INR 0.0 0 1 It’s a 6 bhk villa situated in Valvan Lonavla.... east 1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4 rows × 38 columns

Creating a distribution plot for price¶

The distribution plot shows the distribution and skewness. From the distplot we can see the area is rightly skewed with a value of 3.45.¶

Skewness is : 3.446796049159059
Text(0.5, 1.0, 'Analysis of Price - Skweness')

Creating a countplot for facing¶

The countplot shows the count of each facing. The east has the highest count followed by the unknown column. The unknown column shows that the rows are not labeled¶

Text(0.5, 1.0, 'Analysis of facing')

Creating a countplot for age of property¶

The countplot shows the count of age of property.¶

From the countplot below most of villas are under a year, meaning they are quite new. Also there is a villa which is 122 years, it might be an outlier as we've seen before using the boxplot.¶

Text(0.5, 1.0, 'Analysis of age of property')

Creating a distribution plot for locality score¶

The distribution plot shows the distribution and skewness. From the distplot we can see the locality score has almost a normal distribution with a skewness of negative 1.37¶

Skewness is : -1.3706736591442539
Text(0.5, 1.0, 'Analysis of Locality score - Skweness')

Creating a distribution plot for project score¶

The distribution plot shows the distribution and skewness. From the distplot we can see the project score is almost a normal distribution with skewness of negative 0.75¶

Skewness is : -0.7508526031710991
Text(0.5, 1.0, 'Analysis of Project score - Skweness')

Creating a distribution plot for builder experience¶

The distribution plot shows the distribution and skewness. From the distplot we can see the builder experience is rightly skewed with a value of 5.17. Also it's confirming the boxplot we saw earlier that shows an outlier in the 2000 mark¶

Skewness is : 5.172789603406742
Text(0.5, 1.0, 'Analysis of Builder Experience - Skweness')

Creating a countplot for new/resale¶

The countplot shows the count of each new/resale. The label 0 shows a new sale while the label 1 shows a resale, from the countplot more than half of the villas are new sales¶

Text(0.5, 1.0, 'Analysis of new/resale')

Creating a countplot for price currency¶

The countplot shows the count of each price currency. The plot shows only one currency which is INR meaning indian rupee¶

Text(0.5, 1.0, 'Analysis of price currency')

Creating a countplot for price negotiable¶

The countplot shows the count of each villa whether the price is negotiable. The label 0 shows the price is non negotiable and label 1 shows a negotiable price. From the plot 96% of all the villas are non negotiable¶

Text(0.5, 1.0, 'price negotiable')

Creating a countplot for status¶

The countplot shows the count of each villa status. The label 0 shows the villa is not ready and the label 1 shows the villa is ready. From the plot below 87% of the villas are not ready¶

Text(0.5, 1.0, 'Analysis of status')

Creating a countplot for furnished¶

The countplot shows the count whether a villa is furnished or not. The label 0 shows the villa is not furnished and the label 1 shows the villa is furnished, from the countplot 83% of the villas are not furnished.¶

Text(0.5, 1.0, 'Analysis of furnished')

Creating a countplot for gymnasium¶

The countplot shows the count of the villas with a gym. The label 0 shows the villa has no gym and the label 1 shows the villa has a gym, from the countplot 53% of the villas have no gym.¶

Text(0.5, 1.0, 'Analysis of Gymnasium')

Creating a countplot for 24 X 7 Security¶

The countplot shows the count of the villas with security. The label 0 shows the villa has no 24 X 7 security and the label 1 shows the villa has 24 X 7 security, from the countplot more than 69% of the villas have no 24 X 7 security¶

Text(0.5, 1.0, 'Analysis of 24 X 7 Security')

Creating a countplot for swimming pool¶

The countplot shows the count of the villas with a swimming pool. The label 0 shows the villa has no swimming pool and the label 1 shows the villa has a swimming pool, from the countplot 53% of the villas have no swimming pool¶

Text(0.5, 1.0, 'Analysis of Swimming Pool')

Creating a countplot for lift(s)¶

The countplot shows the count of the villas with a lift. The label 0 shows the villa has no lift and the label 1 shows the villa has a lift, from the countplot 62% of the villas have no lift¶

Text(0.5, 1.0, 'Analysis of Lift(s)')

Creating a countplot for jogging track¶

The countplot shows the count of the villas with a jogging track. The label 0 shows the villa has a no jogging track and the label 1 shows the villa has a jogging track, from the countplot 71% of the villas have no jogging track¶

Text(0.5, 1.0, 'Analysis of Jogging Track')

Creating a countplot for club house¶

The countplot shows the count of the villas with a club house. The label 0 shows the villa has no club house and the label 1 shows the villa has a club house, from the countplot 52% of the villas have no club house¶

Text(0.5, 1.0, 'Analysis of Club House')

Creating a countplot for landscaped gardens¶

The countplot shows the count of the villas with a landscape garden. The label 0 shows the villa has no landscape garden and the label 1 shows the villa has a landscape garden, from the countplot 89% of the villas have no garden¶

Text(0.5, 1.0, 'Analysis of Landscaped Gardens')

Creating a countplot for children's play area¶

The countplot shows the count of the villas with a play area for kids. The label 0 shows the villa has no play area and the label 1 shows the villa has a play area, from the countplot 71% of the villas have no play area¶

Text(0.5, 1.0, "Analysis of Children's play area")

Creating a countplot for sports facility¶

The countplot shows the count of the villas with a sports facility. The label 0 shows the villa has no sport facility and the label 1 shows the villa has a sport facility, from the countplot 60% of the villas have no sport facility¶

Text(0.5, 1.0, 'Analysis of Sports Facility')

Creating a countplot for car parking¶

The countplot shows the count of the villas with a car parking space. The label 0 shows the villa has no parking space and the label 1 shows the villa has a parking space, from the countplot 67% of the villas have no car parking¶

Text(0.5, 1.0, 'Analysis of Car Parking')

Creating a countplot for full power backup¶

The countplot shows the count of the villas with a full power backup. The label 0 shows the villa has no full power backup and the label 1 shows the villa has a full power backup, from the countplot 37% of the villas have no power backup¶

Text(0.5, 1.0, 'Analysis of Full Power Backup')

Creating a countplot for indoor games¶

The countplot shows the count of the villas with a gym. The label 0 shows the villa has no indoor games and the label 1 shows the villa has indoor games, from the countplot 83% of the villas have no indoor games¶

Text(0.5, 1.0, 'Analysis of Indoor Games')

Creating a countplot for Intercom¶

The countplot shows the count of the villas with an intercom. The label 0 shows the villa has no intercom and the label 1 shows the villa has intercom, from the countplot 53% of the villas have no intercom¶

Text(0.5, 1.0, 'Analysis of Intercom')

Creating a countplot for maintenance staff¶

The countplot shows the count of the villas with maintenance staff. The label 0 shows the villa has no maintenance staff and the label 1 shows the villa has maintenance staff, from the countplot 90% of the villas have no maintenance staff¶

Text(0.5, 1.0, 'Analysis of Maintenance Staff')

Creating a countplot for shopping mall¶

The countplot shows the count of the villas with shopping mall. The label 0 shows the villa has no shopping mall and the label 1 shows the villa has shopping mall, from the countplot 96% of the villas have no shopping mall¶

Text(0.5, 1.0, 'Analysis of Shopping Mall')

Creating a countplot for cafeteria¶

The countplot shows the count of the villas with a cafeteria. The label 0 shows the villa has no cafeteria and the label 1 shows the villa has a cafeteria, from the countplot 93% of the villas have no cafeteria¶

Text(0.5, 1.0, 'Analysis of Cafeteria')

Creating a countplot for ATM¶

The countplot shows the count of the villas with an ATM. The label 0 shows the villa has no ATM and the label 1 shows the villa has a ATM, from the countplot 95% of the villas have no ATM¶

Text(0.5, 1.0, 'Analysis of ATM')

Creating a countplot for Multipurpose room¶

The countplot shows the count of the villas with a multipurpose room. The label 0 shows the villa has no multipurpose room and the label 1 shows the villa has a multipurpose room, from the countplot 86% of the villas have no multipurpose room¶

Text(0.5, 1.0, 'Analysis of Multipurpose Room')

Creating a countplot for school¶

The countplot shows the count of the villas with a school. The label 0 shows the villa has no school and the label 1 shows the villa has a school, from the countplot 97% of the villas have no school¶

Text(0.5, 1.0, 'Analysis of School')

Creating a countplot for hospital¶

The countplot shows the count of the villas with a hospital. The label 0 shows the villa has no hospital and the label 1 shows the villa has a hospital, from the countplot 97% of the villas have no hospital¶

Text(0.5, 1.0, 'Analysis of Hospital')

Creating a countplot for golf course¶

The countplot shows the count of the villas with a golf course. The label 0 shows the villa has no golf course and the label 1 shows the villa has a golf course, from the countplot 96% of the villas have no golf course¶

Text(0.5, 1.0, 'Analysis of Golf Course')

Creating a countplot for staff quarter¶

The countplot shows the count of the villas with a staff quarter. The label 0 shows the villa has no staff quarter and the label 1 shows the villa has a staff quarter, from the countplot 93% of the villas have no staff quarter¶

Text(0.5, 1.0, 'Analysis of Staff Quarter')

Bivariate Analysis¶

Bivariate analysis involves the analysis of two variables , for the purpose of determining the emperical relationship between them.

Creating a pairplot for the dataframe¶

The pairplot shows the relationship between each and every variable with each other

Text(0.5, 1.0, 'Pairplot')

From above pair plot, we observed/deduced below

  1. price: price distribution is Right-Skewed
  2. area: area distribution is Right-Skewed
  3. new/resale: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  4. status: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  5. price_negotiable: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  6. age of property: No clear relationship with price or any other feature.
  7. lift: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  8. full power backup: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  9. 24 X 7 security: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  10. childrens play area: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  11. club house: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable

From above pair plot, we observed/deduced below Continuation

  1. gymnasium: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  2. swimming pool: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  3. sports facility: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  4. builder experience: No clear relationship with price or any other feature.
  5. jogging track: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  6. locality score: Its plot with price has no linear relationship. Distribution has number of gaussians/b>
  7. project score: Its plot with price has no linear relationship. Distribution has number of gaussians
  8. landscape gardens: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  9. rain water harvesting: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  10. car parking: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  11. vaastu compliant: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable

From above pair plot, we observed/deduced below Continuation

  1. shopping mall: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  2. school: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  3. hospital: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  4. cafeteria: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  5. staff quarter: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  6. ATM: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  7. multipurpose room: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  8. maintenance room: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  9. indoor games: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  10. intercom: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable
  11. Golf course: No clear relationship with price or any other feature. 2 unique values so can be converted to Categorical Variable

Creating a correlation for the dataframe¶

Correlation is primarily concerned with finding out whether a relationship exists between variables and then determining the magnitude and action of that relationship.

area price status new/resale price_negotiable furnished age of property Lift(s) Full Power Backup 24 X 7 Security ... Intercom Indoor Games Maintenance Staff Multipurpose Room ATM Cafeteria Staff Quarter Hospital School Shopping Mall
area 1.000000 0.793453 0.097484 -0.307463 0.102195 0.170988 0.140596 0.205812 0.254673 0.190184 ... 0.315960 0.162077 0.065576 0.102418 0.034531 0.164943 0.065662 -0.033823 -0.027773 -0.026709
price 0.793453 1.000000 0.112521 -0.318050 0.076041 0.173623 0.093551 0.127169 0.180562 0.088652 ... 0.204398 0.074818 0.039846 0.009506 0.024413 0.108878 0.028044 -0.009604 -0.014931 -0.011668
status 0.097484 0.112521 1.000000 -0.223027 0.114010 0.047470 0.138892 0.014098 0.018219 0.099229 ... -0.000084 0.065624 0.126530 0.040934 -0.005364 -0.000333 -0.012530 0.056585 0.068632 0.063711
new/resale -0.307463 -0.318050 -0.223027 1.000000 0.025964 -0.331777 -0.377042 -0.085300 -0.205622 -0.198094 ... -0.266692 -0.019227 -0.043098 -0.070493 0.006840 0.145067 -0.069578 -0.029430 -0.027377 -0.051023
price_negotiable 0.102195 0.076041 0.114010 0.025964 1.000000 -0.032321 0.075315 0.020003 -0.049720 0.079163 ... -0.124906 0.036931 0.077445 0.044085 -0.004536 0.097383 0.026531 0.029508 0.088360 0.033062
furnished 0.170988 0.173623 0.047470 -0.331777 -0.032321 1.000000 0.175576 0.059782 0.142794 0.051081 ... 0.157295 -0.060622 0.102541 -0.009183 0.020468 -0.119341 0.063600 0.032227 0.024948 0.062439
age of property 0.140596 0.093551 0.138892 -0.377042 0.075315 0.175576 1.000000 0.165719 0.208526 0.399250 ... 0.252061 0.170617 0.152893 0.136514 0.092437 -0.031179 0.157386 0.126762 0.122623 0.155056
Lift(s) 0.205812 0.127169 0.014098 -0.085300 0.020003 0.059782 0.165719 1.000000 0.452385 0.444985 ... 0.569569 0.286389 0.187423 0.346355 0.277147 0.187005 0.316545 0.179299 0.148599 0.229271
Full Power Backup 0.254673 0.180562 0.018219 -0.205622 -0.049720 0.142794 0.208526 0.452385 1.000000 0.455141 ... 0.688382 0.313648 0.232003 0.292325 0.168486 0.215101 0.201109 0.110450 0.102605 0.146695
24 X 7 Security 0.190184 0.088652 0.099229 -0.198094 0.079163 0.051081 0.399250 0.444985 0.455141 1.000000 ... 0.504312 0.525083 0.352721 0.484819 0.340442 0.197785 0.395191 0.215331 0.226741 0.251413
Children's play area 0.006799 -0.065953 -0.116938 0.125220 -0.089490 -0.039244 0.092137 0.406819 0.477902 0.344974 ... 0.484545 0.287492 0.204227 0.241577 0.139453 0.180893 0.169126 0.089763 0.080282 0.108827
Club House 0.241640 0.155450 0.001334 -0.151795 -0.027283 0.069504 0.188979 0.632570 0.584850 0.503335 ... 0.750214 0.434354 0.298140 0.379059 0.218169 0.274490 0.247822 0.145360 0.152559 0.190734
Gymnasium 0.300888 0.150523 0.033737 -0.197145 -0.061365 0.067607 0.272298 0.568658 0.630950 0.554502 ... 0.794902 0.467061 0.348164 0.407138 0.198646 0.298173 0.252311 0.104403 0.101121 0.164338
Swimming Pool 0.356388 0.209707 0.026458 -0.194784 -0.020097 0.146852 0.247649 0.575833 0.655146 0.524687 ... 0.770919 0.443020 0.311915 0.375132 0.198111 0.289198 0.242885 0.117699 0.113880 0.174812
Sports Facility 0.279877 0.143833 -0.048244 -0.190436 -0.095548 0.109808 0.196537 0.574783 0.590724 0.436755 ... 0.786183 0.375521 0.171249 0.401804 0.185028 0.241149 0.266637 0.115168 0.099949 0.118227
Jogging Track 0.198497 0.086632 0.069523 -0.072862 0.044732 0.028907 0.278814 0.455393 0.429551 0.769896 ... 0.514606 0.619927 0.430206 0.526943 0.368789 0.361484 0.415188 0.257253 0.253569 0.326264
Landscaped Gardens 0.083397 0.013896 0.067590 -0.041353 0.106354 -0.015506 0.155358 0.158827 0.252946 0.337203 ... 0.243638 0.435030 0.388822 0.441362 0.120790 0.419022 0.005716 0.007905 0.023974 0.119239
locality_score 0.120238 0.173677 0.034737 -0.408020 -0.044342 0.017487 0.124899 0.034319 -0.066436 -0.035903 ... -0.064406 -0.146128 -0.219317 -0.199737 -0.059848 -0.339653 0.063032 0.055016 0.062266 0.001121
project_score 0.003288 -0.021737 -0.095453 -0.082216 -0.075545 -0.030654 -0.030605 0.101267 0.057957 0.102147 ... 0.105365 0.226262 -0.051753 0.196414 0.171820 -0.007015 0.281906 0.063541 0.092886 0.068727
builder_experience 0.030644 0.007455 0.039760 -0.033992 -0.045322 0.004896 0.044081 -0.041409 -0.011121 -0.043546 ... -0.000787 -0.107873 -0.005541 -0.150704 0.049959 -0.002207 -0.087706 0.141317 0.133889 0.077666
Rain Water Harvesting 0.050226 -0.022825 -0.005158 0.019429 0.074470 -0.047996 0.196216 0.542454 0.313762 0.681917 ... 0.405457 0.518323 0.239076 0.510036 0.357972 0.169229 0.427635 0.234451 0.232068 0.274940
Car Parking 0.172775 0.072986 0.130115 -0.157387 0.113729 0.056528 0.425569 0.445149 0.444455 0.777628 ... 0.459024 0.523367 0.462038 0.533940 0.336374 0.334181 0.369310 0.234642 0.230512 0.297588
Vaastu Compliant 0.093096 0.007914 0.043084 -0.179501 0.077673 0.045458 0.262397 0.398116 0.303161 0.579773 ... 0.356798 0.465193 0.231875 0.480669 0.544604 0.335581 0.564655 0.361125 0.377796 0.466772
Golf Course 0.138415 0.110404 0.027907 0.054725 0.041777 -0.058175 0.007930 0.196282 0.122585 0.237957 ... 0.179264 0.250690 0.009838 0.354513 0.454114 0.459312 0.209407 0.005068 0.002341 -0.010708
Intercom 0.315960 0.204398 -0.000084 -0.266692 -0.124906 0.157295 0.252061 0.569569 0.688382 0.504312 ... 1.000000 0.373019 0.261493 0.380412 0.217165 0.221958 0.251103 0.130992 0.113496 0.185284
Indoor Games 0.162077 0.074818 0.065624 -0.019227 0.036931 -0.060622 0.170617 0.286389 0.313648 0.525083 ... 0.373019 1.000000 0.318582 0.491072 0.175962 0.418346 0.308454 0.086757 0.077420 0.135253
Maintenance Staff 0.065576 0.039846 0.126530 -0.043098 0.077445 0.102541 0.152893 0.187423 0.232003 0.352721 ... 0.261493 0.318582 1.000000 0.212731 0.110363 0.322303 0.095577 0.054779 0.090898 0.248831
Multipurpose Room 0.102418 0.009506 0.040934 -0.070493 0.044085 -0.009183 0.136514 0.346355 0.292325 0.484819 ... 0.380412 0.491072 0.212731 1.000000 0.336146 0.382051 0.268544 0.205301 0.192092 0.211028
ATM 0.034531 0.024413 -0.005364 0.006840 -0.004536 0.020468 0.092437 0.277147 0.168486 0.340442 ... 0.217165 0.175962 0.110363 0.336146 1.000000 0.400737 0.472494 0.666895 0.638368 0.688177
Cafeteria 0.164943 0.108878 -0.000333 0.145067 0.097383 -0.119341 -0.031179 0.187005 0.215101 0.197785 ... 0.221958 0.418346 0.322303 0.382051 0.400737 1.000000 0.295501 0.136174 0.152340 0.191844
Staff Quarter 0.065662 0.028044 -0.012530 -0.069578 0.026531 0.063600 0.157386 0.316545 0.201109 0.395191 ... 0.251103 0.308454 0.095577 0.268544 0.472494 0.295501 1.000000 0.343404 0.353398 0.387591
Hospital -0.033823 -0.009604 0.056585 -0.029430 0.029508 0.032227 0.126762 0.179299 0.110450 0.215331 ... 0.130992 0.086757 0.054779 0.205301 0.666895 0.136174 0.343404 1.000000 0.918543 0.754209
School -0.027773 -0.014931 0.068632 -0.027377 0.088360 0.024948 0.122623 0.148599 0.102605 0.226741 ... 0.113496 0.077420 0.090898 0.192092 0.638368 0.152340 0.353398 0.918543 1.000000 0.755668
Shopping Mall -0.026709 -0.011668 0.063711 -0.051023 0.033062 0.062439 0.155056 0.229271 0.146695 0.251413 ... 0.185284 0.135253 0.248831 0.211028 0.688177 0.191844 0.387591 0.754209 0.755668 1.000000

34 rows × 34 columns

We have linear relationships in below featues as we got to know from above matrix

  1. price: area
  2. area: price.
  3. shopping mall: school, hospital
  4. car parking: jogging track, 24 X 7 security
  5. club house: intercom, sports facility, gymnasium, swimming pool

We can plot heatmap and can easily confirm our above findings

Creating a heatmap plot for the dataframe¶

Heatmap is a graphical representation of data where values are depicted by color. Also we are using it to confirm the correlation

Text(0.5, 1.0, 'Heatmap')

Bivariate analysis of price and age of property¶

Here we looking at the relationship between price and age of property, from the plot there isn't any real relationship between price and the age of property

Text(0.5, 1.0, 'Bivariate Analysis of age of property and price')

Bivariate analysis of area and location based on facing¶

Here we looking at the relationship between area and location, from the plot there isn't any clear relation between them

AxesSubplot(0.125,0.125;0.775x0.755)

Bivariate analysis of builder experience and price¶

Here we looking at the relationship between builder experience and price, from the plot there is no relation between them

AxesSubplot(0.125,0.125;0.775x0.755)
Text(0.5, 1.0, 'Bivariate Analysis of builder experience and price')
AxesSubplot(0.125,0.125;0.775x0.755)
Text(0.5, 1.0, 'Bivariate Analysis of builder experience < 250 and price')

Bivariate analysis of locality score and price¶

Here we looking at the relationship between locality score and price, from the plot there is no relation between them

AxesSubplot(0.125,0.125;0.775x0.755)
Text(0.5, 1.0, 'Bivariate Analysis of locality score and price')

Bivariate analysis of project score and price¶

Here we looking at the relationship between project score and price, from the plot there is no relation between them

AxesSubplot(0.125,0.125;0.775x0.755)
Text(0.5, 1.0, 'Bivariate Analysis of project score and price')

Bivariate analysis of facing and price¶

Here we looking at the relationship between facing and price, from the plot there is no relation between them.

AxesSubplot(0.125,0.125;0.775x0.755)
Text(0.5, 1.0, 'Bivariate Analysis of facing and price')

Bivariate analysis of area and price¶

Here we looking at the relationship between area and price, from the plot there is a linear relation between them. As the area increases the price increases.

AxesSubplot(0.125,0.125;0.775x0.755)
Text(0.5, 1.0, 'Bivariate Analysis of area and price')

Summary¶

From all the attributes only area has a linear relationship with price, meaning as the area increases the price increases.

Modelling Section¶

Verifying the columns¶

Index(['location', 'area', 'price', 'price_currency', 'status', 'new/resale',
       'price_negotiable', 'description', 'furnished', 'age of property',
       'Lift(s)', 'Full Power Backup', '24 X 7 Security',
       'Children's play area', 'Club House', 'Gymnasium', 'Swimming Pool',
       'Sports Facility', 'Jogging Track', 'Landscaped Gardens',
       'locality_score', 'project_score', 'builder_experience',
       'Rain Water Harvesting', 'Car Parking', 'Vaastu Compliant',
       'Golf Course', 'Intercom', 'Indoor Games', 'Maintenance Staff',
       'Multipurpose Room', 'ATM', 'Cafeteria', 'Staff Quarter', 'Hospital',
       'School', 'Shopping Mall', 'east', 'north', 'northeast', 'northwest',
       'south', 'southeast', 'southwest', 'unknown', 'west'],
      dtype='object')

Checking the correlation once more using heatmap

Using the correlation to select the desired features. Based on the correlation 8 columns were chosen to be the desired features. this features would be used fot the modelling

price area new/resale Swimming Pool Intercom Full Power Backup locality_score furnished Club House
0 6500000 1780 0 0 0.0 0 8.000000 0 0
1 17500000 2225 0 0 0.0 0 8.015749 1 0
2 6500000 1785 0 0 0.0 0 8.000000 0 0
3 15000000 1200 0 0 0.0 0 8.015749 1 0
4 35000000 4000 0 0 0.0 0 8.015749 1 0

Model Selection and Train¶

Splitting the data into train and test data The result is are seen below X_train.shape, X_test.shape, y_train.shape, y_test.shape

(735, 8) (184, 8) (735,) (184,)

Model Selection¶

We are selecting 6 Models to work with, namely; Linear Regression, Ridge Regressor, Lasso Regressor, Elastic Net, SVR Regressor, Random Forest.

  1. Linear regression - is an algorithm used to predict, or visualize, a relationship between two different features/variables
  2. Ridge regressor - Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity.
  3. Lasso regressor - Lasso regression is like linear regression, but it uses a technique "shrinkage" where the coefficients of determination are shrunk towards zero
  4. Elastic net - is a classification algorithm that overcomes the limitations of the lasso (least absolute shrinkage and selection operator) method which uses a penalty function in its L1 regularization
  5. SVR regressor - is a regression algorithm that supports both linear and non-linear regressions.
  6. Random forest - is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting
Model Name:  Linear Regression
Model:  LinearRegression()
time: 0.5953559875488281
----------------------------------
Model Name:  Ridge Regressor
Model:  Ridge()
time: 0.16841793060302734
----------------------------------
Model Name:  Lasso Regressor
Model:  Lasso()
time: 0.252666711807251
----------------------------------
Model Name:  Elastic Net
Model:  ElasticNet()
time: 0.01598834991455078
----------------------------------
Model Name:  SVR Regressor
Model:  SVR()
time: 10.357428550720215
----------------------------------
Model Name:  Random Forest Regressor
Model:  RandomForestRegressor()
time: 470.0197710990906
----------------------------------
{'model_names': dict_keys(['Linear Regression', 'Ridge Regressor', 'Lasso Regressor', 'Elastic Net', 'SVR Regressor', 'Random Forest Regressor']),
 'best_params': [{'fit_intercept': False, 'n_jobs': 2},
  {'alpha': 0.25, 'solver': 'auto'},
  {'alpha': 1.0, 'tol': 0.0001},
  {'alpha': 0.25},
  {'C': 100000, 'epsilon': 0.5, 'kernel': 'poly'},
  {'criterion': 'squared_error',
   'min_samples_leaf': 3,
   'min_samples_split': 10,
   'n_estimators': 50}],
 'best_models': [LinearRegression(fit_intercept=False, n_jobs=2),
  Ridge(alpha=0.25),
  Lasso(),
  ElasticNet(alpha=0.25),
  SVR(C=100000, epsilon=0.5, kernel='poly'),
  RandomForestRegressor(min_samples_leaf=3, min_samples_split=10, n_estimators=50)],
 'best_mae': [9160832.829474986,
  9202305.62213634,
  9149756.873674182,
  16093489.214604214,
  13902783.439677324,
  6755335.8286635885],
 'best_mse': [308640837606585.2,
  313231897628034.0,
  307359853885892.06,
  789616362996044.5,
  916945486174921.6,
  222688499449941.12],
 'best_r2': [0.6698591868880606,
  0.6649483128109922,
  0.6712294041620035,
  0.15537882106744094,
  0.019179928577758742,
  0.7617989801699765]}
For Best Model idx 
mse: 5 
mae: 5 
R2: 5

Model Summary¶

Below are the summaries for the 6 models used

model_names best_params best_models best_mae best_mse best_r2
0 Linear Regression {'fit_intercept': False, 'n_jobs': 2} LinearRegression(fit_intercept=False, n_jobs=2) 9.160833e+06 3.086408e+14 0.669859
1 Ridge Regressor {'alpha': 0.25, 'solver': 'auto'} Ridge(alpha=0.25) 9.202306e+06 3.132319e+14 0.664948
2 Lasso Regressor {'alpha': 1.0, 'tol': 0.0001} Lasso() 9.149757e+06 3.073599e+14 0.671229
3 Elastic Net {'alpha': 0.25} ElasticNet(alpha=0.25) 1.609349e+07 7.896164e+14 0.155379
4 SVR Regressor {'C': 100000, 'epsilon': 0.5, 'kernel': 'poly'} SVR(C=100000, epsilon=0.5, kernel='poly') 1.390278e+07 9.169455e+14 0.019180
5 Random Forest Regressor {'criterion': 'squared_error', 'min_samples_le... (DecisionTreeRegressor(max_features='auto', mi... 6.755336e+06 2.226885e+14 0.761799

Plotting the mean square error¶

Mean squared error (MSE) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value SVR Regressor and Elastic net had the highest mse on test data

Plotting the mean absolute error¶

Mean absolute error (MAE) is a metric that is used to evaluate the performance of regression models. It’s defined as the average of the absolute difference between actual and predicted values. Elastic net and SVR Regressor had the highest mae on test data

Plotting the root mean square¶

Root Mean Square Error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. Random forest Regressor has the highest rms value

Model Tuning¶

Model tuning is the systematic modification of model parameters to identify the most performant model. Based on the result of the best parameter, we tune the parameters as suggested.

{'criterion': 'squared_error', 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 50}

After Tuning the parameters as suggested. The result is seen below

0.6861450833517937